Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

94 ◾ Bioinformatics

used for the de novo assembly. For the purpose of the practice, we will use paired-end short

reads produced by Illumina MiSeq for whole genome sequencing of Escherichia coli (str.

K-12). The files (forward and reverse FASTQ files) are available at the NCBI SRA database.

First, using the Linux terminal, create a directory for the exercise and change to it.

mkdir denovo; cd denovo

Inside that directory, create a subdirectory for the FASTQ files.

mkdir fastq; cd fastq

Then, you can download the raw data files into “fastq” directory using SRA toolkits as

follows:

fasterq-dump --threads 4 --verbose ERR1007381

To save some storage space, you can compress the two FASTQ files with GZIP utility as

follows:

gzip ERR1007381_1.fastq

gzip ERR1007381_2.fastq

The compression will reduce the storage of the two files from 11 G to 3 G.

We will use “abyss-pe” command to perform the de novo genome assembly. Change to

the main exercise directory just a single step out of “fastq” directory by using “cd ..”. Then,

run the following command to construct contigs:

abyss-pe \

name=ecoli \

j=4 \

k=25 \

c=360 \

e=2 \

s=200 \

v=-v \

in=’fastq/ERR1007381_1.fastq.gz fastq/ERR1007381_2.fastq.gz’ \

contigs \

2>&1 | tee abyss.log

The following will construct scaffolds from the contigs:

abyss-pe \

name=ecoli \

j=4 \

k=25 \

c=360 \